We previously discussed the problem you needed answered. The first part of this report is the math behind it and the second part are simulations in how this problem would look in the real world.
The survey question asked which major city in Texas the respondent was from. These cities will be represented by the x values in the math and as probabilities in the simulations. Finding the optimal termination point (minimize terminations while reaching our 400 respondent minimum) is answered by the math, but this does not give us a clear understanding of how many surveys will be terminated for choosing this optimal point. And since minimizing terminations is an important part of this problem, the simulation gives a clear understanding on what to expect, and this depends heavily on the difference in probabilities between each question.
400 = X_1 + X_2 + X_3 + X_4
The first concern is making sure each X_i will reach at least 80, so we make the last term the minimum.
X_min = 80
400 = 3X + X_min
320 = 3X
X = 106.6667
Since there is no such thing as a partial person, we must divide this up into whole numbers
n = 400
xmin = 80
terms_left = 3
remainder = (n - xmin) # 320
# 3 terms remaining
modulo = remainder %% 3 # (320 modulo 3) = 2, so 3 divides into 320 until there is a remainder of 2.
modulo
## [1] 2
x1 <- floor(remainder/terms_left) #320/3 rounded down = 106
x1
## [1] 106
80 is the minimum value and first known value. Knowing that 320/3 is 106.67 and 320 modulo 3 = 2, we know the remaining terms will be 106.67 rounded down and 106+1 and 106+1. The modulo is the remainder that will be divided by the number of remaining terms and added to them.
The math is shown by:
x1
## [1] 106
x2 = (x1 + modulo/2);x2
## [1] 107
x3 = (x1 + modulo/2);x3
## [1] 107
xmin
## [1] 80
x1 + (x1 + modulo/2) + (x1 + modulo/2) + xmin
## [1] 400
This can be applied to any number of required surveys with a given number of required questions (x terms).
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.3.0
## v tibble 2.0.1 v dplyr 0.8.0.1
## v tidyr 0.8.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts -------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Simulation <- function(term_point = 120,
min_acceptable = 80,
min_surveys = 400,
p1 = .35,
p2 = .30,
p3 = .25,
p4 = .10,
maxed_reached_points = F,
final_paid = F,
complete_event = F){
### Intialization
probs = c(p1, p2, p3, p4)
nxt = 1
min_reached = F
outcome <- tibble(P1 = NA,
P2 = NA,
P3 = NA,
P4 = NA)
terminated <- outcome
### Rerun the simulation until the minimum is reached OR the minium number of surveys is met.
quota <- outcome
while(min_reached==F | (sum(colSums(quota))<min_surveys) ){
results <- matrix(t(rmultinom(1,1,probs)), ncol = 4)
results <- tibble(P1 = results[,1],
P2 = results[,2],
P3 = results[,3],
P4 = results[,4])
outcome[nxt,] <- results
outcome_total <- colSums(outcome[1:nxt,])
maxed_reached = sum(
colSums(outcome[1:(nxt),]) > term_point
)
quota <- tibble(p1 = ifelse(sum(outcome$P1)>term_point, term_point, sum(outcome$P1)),
p2 = ifelse(sum(outcome$P2)>term_point, term_point, sum(outcome$P2)),
p3 = ifelse(sum(outcome$P3)>term_point, term_point, sum(outcome$P3)),
p4 = ifelse(sum(outcome$P4)>term_point, term_point, sum(outcome$P4)))
if(maxed_reached != 0){
terminated[nxt,] <- results
}
min_reached <- (min(colSums(outcome[1:nxt,])) >= min_acceptable)
if(min_reached==FALSE | sum(colSums(quota))<min_surveys){
nxt = nxt + 1
}
}
#quota
#sum(colSums(outcome))
#(colSums(final))
maxed <- sum(is.na(rowSums(terminated, na.rm=F))) # is when the question has hit is first maximum value
over <- tibble(P1 = 0,
P2 = 0,
P3 = 0,
P4 = 0)
if(maxed_reached_points==T){
maxes <- tibble(P1 = min(which(cumsum(outcome$P1)==term_point), na.rm = T),
P2 = min(which(cumsum(outcome$P2)==term_point), na.rm = T),
P3 = min(which(cumsum(outcome$P3)==term_point), na.rm = T),
P4 = min(which(cumsum(outcome$P4)==term_point), na.rm = T))
#inf_fix <- function(x){ifelse(x==Inf, 0, x)}
#maxes$P1 <- ifelse(maxes$P1==Inf, NA, maxes$P1)
#maxes$P2 <- ifelse(maxes$P2==Inf, NA, maxes$P2)
#maxes$P3 <- ifelse(maxes$P3==Inf, NA, maxes$P3)
#maxes$P4 <- ifelse(maxes$P4==Inf, NA, maxes$P4)
}
if(maxed>1){
t1 <- terminated[!is.na(terminated$P1), ]
paid_outcomes <- outcome[1:maxed,]
extra_respondents = nrow(outcome) - maxed # extra respondents needed to reach the minimum goal
over_r = 1
for(i in 1:extra_respondents){
check <- ((colSums(bind_rows(paid_outcomes, t1[i,]))) > term_point)
if(sum(check) == 0){
paid_outcomes <- bind_rows(paid_outcomes, t1[i,])
} else {
over[over_r,] <- t1[i,]
over_r = over_r + 1
}
}##end for loop
} else {
paid_outcomes = outcome
}
output <- list(
terminated = as.numeric(sum(colSums(over))),
quota = quota)
if(complete_event == T){
output$outcome=outcome
}
if(maxed_reached_points==T){
output$maxes = maxes
}
if(final_paid == T){
output$paid_outcomes = paid_outcomes
}
return(output)
}
This simulation is ran once with the term point at 120.
term_point = 120
min_acceptable = 80
min_surveys = 400
first_simulation <- Simulation(term_point = 120,
min_acceptable = 80,
min_surveys = 400,
p1 <- .35,
p2 <- .30,
p3 <- .25,
p4 <- .10,
maxed_reached_points = T,
final_paid = T,
complete_event = T)
If the probabilities of answering each these questions are 0.35, 0.3, 0.25, 0.1, respectively, with a termination point of 120 and minimum acceptance of 80:
Total Responses
colSums(first_simulation$outcome)
## P1 P2 P3 P4
## 273 231 184 80
Totalling 768 total survey clicks.
Paid Responses
colSums(first_simulation$paid_outcomes)
## P1 P2 P3 P4
## 120 120 120 80
Totalling 440 paid responses required to reach the minimum acceptance point, given the termination point of 120.
The respondents that were terminated, by question, are:
colSums(first_simulation$outcome) - colSums(first_simulation$paid_outcomes)
## P1 P2 P3 P4
## 153 111 64 0
Totalling 328
The following simulations will test for all the termination points we discussed. Notice that the probabilities will have a greater range.
This is one of the key findings, when the probabilities of answering one question are much higher than the rest, more respondents will be terminated in general. This is imporant if we want a perfect termination point.
The probabilites that a respondent will answer any given question are 0.4, 0.1, 0.25, 0.25.
The spread of the probabilities is 0.1224745, which is the standard deviation, with a mean of .25. This standard deviation is high, which I expect will be the case in reality.
start_time <- Sys.time()
for(i in 1:nrow(testing)){
test_model <- Simulation(testing$termination_point[i],
p1 = p1,
p2 = p2,
p3 = p3,
p4 = p4,
maxed_reached_points = F,
final_paid = T,
complete_event = T)
testing[i,"terminated"] <- test_model$terminated
testing[i,"paid_surveys"] <- sum(colSums(test_model$paid_outcomes))
testing[i,c(4:7)] <- test_model$quota
}
end_time <- Sys.time()
end_time - start_time
## Time difference of 32.60181 secs
A termination point of 106 is the best for fewest number of terminated respondents.
A termination point of 107 yields the fewest number of terminated respondents and is closest to the survey allowance.
The perfect theoretical termination point is 106, when it is reached, the termination point must be switched to 107 for the rest of the surveys.
To gain a greater understanding of how probabiliites effect the optimal termination point, the simulation will be ran with 71 different sets of probabilities. These probabilities will be randomly generated, ranging from a low to high standard deviations.
p_rand <- function(mean = .25, sd = .05){
continue <- TRUE
while(continue){
params <- round(rnorm(3, mean = mean, sd = sd),3)
for(i in 1:4){
if(i<4){
assign(paste0("p", i), params[i])
} else assign(paste0("p", i), 1-sum(params))
}
sd_p <- sd(c(p1, p2, p3, p4))
if(sd_p< (sd+.01) & sd_p > (sd-.01))
continue <- FALSE
}
return(c(p1, p2, p3, p4))
}
se <- seq(.04, .11, .001)
params <- tibble(sds = se,
P1 = NA,
P2 = NA,
P3 = NA,
P4 = NA,
.rows = length(se))
for(j in seq_along(se)){
params[j,"sds"] <- round(sd(p_rand(sd = se[j])),3)
params[j,c(2:5)] <- p_rand(sd = se[j])
}
plot(params$sds)
Each dot here measures the standard deviation of each set of probablities. NOTE, if the probabilities were all 0.25, the standard deviation would be 0. The previous test used an sd of 0.1224745, which is high.
This simulation will be ran 1,491 times, with 21 termination points ranging from 100 to 120 and 71 probablility sets, with standard deviations ranging between 0.038, 0.117.
start_time <- Sys.time()
for(i in 1:length(outputs)){
outputs[[i]] <- Simulation(term_point = arguments$term_point[i],
p1 = arguments$p1[i],
p2 = arguments$p2[i],
p3 = arguments$p3[i],
p4 = arguments$p4[i])
}
end_time <- Sys.time()
end_time - start_time
## Time difference of 22.13159 mins
output_df <- flatten(outputs)
quota_ind <- which(names(output_df) == "quota")
over_ind <- which(names(output_df) == "terminated")
output_dfs <- do.call(bind_rows, output_df[quota_ind])
output_dfs$terminated <- output_df[over_ind]
simparam <- tibble(
termination_point = rep(seq(100, 120, 1), nrow(params)),
sd = rep(params$sds, each=21)
)
sim_results <- bind_cols(simparam, output_dfs)
sim_results$quota_over <- rowSums(sim_results[,c(3:6)]) - min_surveys
sim_results$terminated <- as.numeric(sim_results$terminated)
arguments_df <- do.call(rbind, arguments) %>% t() %>%
as.tibble() %>%
rename(P1 = p1, P2 = p2, P3 = p3, P4 = p4) %>%
mutate(term_point=NULL)
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
sim_results <- bind_cols(sim_results, arguments_df)
The lower the termination point, the greater the number of terminated surveys. The higher the termination point, the greater the number of paid surveys that will be needed to reach the minimum point. We see that a termination point of 106 results in 400 paid surveys, no matter the probability of each question. A termination point of 107 results in needing a maximum of 1 more survey to reach the minimum for about half of the simulation outcomes.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(sim_results, x = ~termination_point, y = ~sd, z = ~quota_over,
text = ~paste("Over Quota:", quota_over,
"P1:",P1,
"P2", P2,
"P3", P3,
"P4", P4, sep = " "), hoverinfo = 'text+x+y', type = "heatmap")
This heatmap gives us a clear representation of the data. 106 is the last termination point where the quota is not surpassed. Purple shows where the quota is met, at 400 surveys. As the color lightens, the required number of paid surveys increases inorder to reach that 80 minimun per survey.
Notice that hovering over the plot shows the randomly generated probabilities for each question. Hovering horizontally they stay the same. As mentioned before, each set of probabilities has a mean of 0.25, but have differing standard deviations, as shown through the y axis (at the top, the probabilities range much greater than below).
plot_ly(sim_results, x = ~termination_point, y = ~sd, z = ~quota_over, color = ~terminated, type = "scatter3d")
This scatterplot shows the same thing, however, it makes it more clear that as the standard deviation between question probabilities increases, so does the number of terminated responses (marked by the lightening colors).
The heatmap below is very similar, but colors the number of terminated responses. The darker the squares, the fewer the terminated responses.
plot_ly(sim_results, x = ~termination_point, y = ~sd, z = ~terminated,
text = ~paste("Over Quota:", quota_over,
"P1:",P1,
"P2", P2,
"P3", P3,
"P4", P4, sep = " "), hoverinfo = 'text+x+y', type = "heatmap")
There is a clear trend that the greater the standard deviation between the probability of a question, the more responses that will be terminated.
plot_ly(sim_results, x = ~termination_point, y = ~sd, z = ~terminated, color = ~quota_over, type = "scatter3d")
This scatterplot colors the number of surveys that were required to reach the survey minimum. A termination point of 106 shows the clear color differential. It also shows a clear positive linear relationship between the number of terminated responses and the standard deviation of question probabilities. We also see that as the termination point increases, the number of terminated responses decreases.